In [1]:
from IPython.display import HTML

# Toggle butten to hide the code from the notebook
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

Calgary Shared Mobility Pilot Trips Analysis

In [2]:
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go
import datetime as dt
from pathlib import Path
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline



import warnings
warnings.filterwarnings('ignore')

Intro

The city of Calgary has been operating a shared e-bike and e-scooter pilot program from October 31, 2018 to October 31, 2019.

Approximately 500 docless electric bycicles, probided by Lime have been available since October 31, 2018 with 168,000 trips taken and 210,000kms traveled between then and September 30, 2019.

Electric scooters (e-scooters) were first made available on July 12, 2019 and were availabie until October 31, 2019. (source TT2019-1374) Both Lime and Bird operated scooter rentals.

Rental data for July 1 to September 30, 2019 was made available through the City of Calgary's open data portal: https://data.calgary.ca/Transportation-Transit/Shared-Mobility-Pilot-Trips/jicz-mxiz

Like you probably do, I had questions:|

  • How popular are the bikes/scooters?
  • Where are doing with them? i.e. Where are they going? How far/how long are the trips?
  • Can we guess how much money has been spent on the scooters/bikes?
  • How does the weather impact rental count?
  • Using the data, can we a few archetypes of particular scooter users that explain how different people are using them?

Let's take a look at the data to find out!

The Data

The City of Calagary provided data from 482k trips (ok actually 482,021!). All trips occured between July 1 and September 30, 2019.

Data available included:

  • Vehicle Type: e-scooter or e-bike
  • Start Date: The day of the trip
  • Start Hour: Hour the trip was started in 24-hour clock (e.g., 13 is 1:00 pm-1:59 pm, 17 is 5:00 pm-5:59 pm)
  • Trip Distance in m
  • Trip Duration in s
  • Lat/Lon of where the trip started and ended (This was within a 10,000$m^2$ hexagon to anonymize the data)

Some of the other columns are somewhat redundent, but helpful for analysis, like naming the hexagon or providing the day of the week etc.

Weather data was obtained from Enviromoment Canada's website: https://climate.weather.gc.ca/

The temperature ($^{\circ}C$), wind speed (km/h) and weather (Sunny, raining etc) were available for every hour.

Cleaning

Before starting, I first cleaned up the data to make it easier for analysis. Main things were linking the weather data to the scooter data. I also calculated some metrics like speed, aerial distance etc. that will be explained in a bit more detail later. Analysis starts from the cleaned data table.

Below is a sample of the final table:

In [3]:
# Read in pre-formatted dataset
project_dir = Path().resolve().parents[0]
file_name = os.path.join(project_dir, 'data', 'final', 'all_data.csv')
all_trips = pd.read_csv(file_name)
all_trips.datetime = pd.to_datetime(all_trips.datetime)
all_trips.start_date = pd.to_datetime(all_trips.start_date)
all_trips.head()
Out[3]:
vehicle_type start_date start_hour start_day start_day_of_week trip_distance trip_duration starting_grid_id ending_grid_id startx ... a_dist travel_efficiency speed a_speed is_weekend is_holiday datetime Temp (°C) Wind Spd (km/h) Weather
0 scooter 2019-08-22 16 Thursday 4 338 129 DN-104 DN-104 -114.071462 ... 62.040324 0.183551 9.432558 1.731358 0 0 2019-08-22 16:00:00 19.4 11.0 Clear
1 scooter 2019-09-13 23 Friday 5 1092 347 DM-103 DM-103 -114.073762 ... 62.040324 0.056813 11.329107 0.643646 0 0 2019-09-13 23:00:00 9.7 9.0 Clear
2 scooter 2019-08-08 10 Thursday 4 2059 547 AL-37 AP-45 -114.255975 ... 1622.721148 0.788111 13.551005 10.679700 0 0 2019-08-08 10:00:00 20.1 10.0 Clear
3 scooter 2019-08-08 11 Thursday 4 158 228 DN-104 DN-104 -114.071462 ... 62.040324 0.392660 2.494737 0.979584 0 0 2019-08-08 11:00:00 22.2 20.0 Clear
4 scooter 2019-07-24 16 Wednesday 3 1009 308 CG-127 CF-128 -114.147194 ... 186.139286 0.184479 11.793506 2.175654 0 0 2019-07-24 16:00:00 21.0 36.0 Clear

5 rows × 23 columns

Also, a quick look at some summary statistics for the dataset:

In [4]:
all_trips[['trip_distance', 'trip_duration']].describe()
Out[4]:
trip_distance trip_duration
count 482021.000000 482021.000000
mean 1846.298045 771.032598
std 1890.667017 809.441301
min 101.000000 31.000000
25% 639.000000 302.000000
50% 1261.000000 503.000000
75% 2335.000000 912.000000
max 56659.000000 9521.000000

So the average trip is about 1.8kms and took 12min 51s. That said there are probably lots of long trips bringing up the average; the maximum trip was 56.6kms long!

The median is probably a better measure of typical use. Half of all trips were < 1.3kms and about 8min 20s.

The difference between bicycle trips and scooter trips doesn't appear to be appreciable. note: for efficiency violin plots are built on a random sample of rides

In [5]:
sample = all_trips.sample(frac=0.05) # 420k datapoints runs too slow

violin_fig1=px.violin(sample, x='vehicle_type', y='trip_distance', box=True, points="outliers",
          labels={'vehicle_type':'Vehicle', "trip_distance": "Trip Distance (m)"},
          title='Sample Distribution of Trip Distances for Bikes and Scooters')
violin_fig1.show()
In [6]:
violin_fig2=px.violin(sample, x='vehicle_type', y='trip_duration', box=True, points="outliers",          
          labels={'vehicle_type':'Vehicle', "trip_duration": "Trip Duration (s)"},
          title='Sample Distribution of Trip Durations for Bikes and Scooters')
violin_fig2.show()

Total Rentals

Before we get too caried away, let's see how many rentals per day over the time period:

In [7]:
line_fig1 = px.line(all_trips.groupby(['start_date','vehicle_type']).count().reset_index(), 
               x="start_date", y='a_dist', color='vehicle_type', 
               labels={'a_dist':'Number of Rentals/Day', 'start_date': 'Date', 'vehicle_type': 'Vehicle'},
               title = " Number of Vehicle Rentals per Day over Trial Period")
line_fig1.show()

A few things stand out:

  • There were a few scooter rentals before the official launch date of July 12th, maybe some testing or limited rentals.

  • There's a big jump in rentals towards the end of July. It's worth noting that Lime had 1000 scooters, starting July 12th (don't know if they were all available right away or if they added more), but Bird started operating their fleet of 500 scooters on July 26th.

E-Bikes are so 2018!

In [8]:
line_fig2 = px.line(all_trips[all_trips['vehicle_type'] == 'bicycle'].groupby(['start_date','vehicle_type']).count().reset_index(), 
               x="start_date", y='a_dist', color='vehicle_type', 
               labels={'a_dist':'Number of Rentals/Day', 'start_date': 'Date', 'vehicle_type': 'Vehicle'},
               title = " Number of Vehicle Rentals per Day over Trial Period")
line_fig2.show()

Because of this, I'm just going to focus on e-scooters for the remaining analysis.

How are People Using the Scooters?

Initially, I wanted to investigate what usage looks like for the scooters. We expect intuatively that there will be some periodicity to the rental patterns. For instance, there's probably less rentals in the middle of the night than during the day.

The following interactive plot shows the rentals per hour, over the entire trial period. Use the selectors to pick a time interval, and the slider to move the date range:

In [9]:
# Just pick Scooters
scooter = all_trips[all_trips['vehicle_type'] == 'scooter']
In [10]:
scooter2 = scooter.groupby(['datetime']).count().reset_index()


fig3 = go.Figure()
fig3.add_trace(go.Scatter(x=scooter2['datetime'],
                         y=scooter2['a_dist'].values.tolist(), 
               mode = 'lines',
               opacity = 1,
#                line = dict(color = '#17BECF'),
               name = 'Scooter Rentals'))
    
# Set title
fig3.update_layout(
    title_text="Number of Scooter Rentals per Hour",
    xaxis = dict(title = 'Date'),
    yaxis = dict(title = 'Rentals/hr')) 

# Add range slider
fig3.update_layout(
    xaxis=go.layout.XAxis(
        rangeselector=dict(
            buttons=list([              
                dict(count=1,
                     label="1d",
                     step="day",
                     stepmode="todate"),
                dict(count=2,
                     label="2d",
                     step="day",
                     stepmode="todate"),
                dict(count=7,
                     label="7d",
                     step="day",
                     stepmode="todate"),
                dict(count=14,
                     label="14d",
                     step="day",
                     stepmode="todate"),
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="todate"),
                dict(count=2,
                     label="2m",
                     step="month",
                     stepmode="todate"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

fig3.show()

Seems like most rentals occur during the middle of the day. There's a mini spike around 8AM on weekdays, likely corresponding to rides to work. The most rides seem to be towards the afternoon, early evening.

If you scroll around, the most rentals was on September 21 at 7-8pm. Not sure what was going on. Possibly the "Stampede Shindig" at Heritage park? https://dailyhive.com/calgary/calgary-events-september-20-22-2019 Let's check a map:

In [11]:
print('Top hour for rentals was: ', str(scooter2.loc[scooter2.vehicle_type.idxmax()][0])[:10])
Top hour for rentals was:  2019-09-21
In [12]:
# Set Mapbox Token
px.set_mapbox_access_token(open(f"{project_dir}\\data\\raw\\mapbox.token").read())

peak_scooter = scooter[scooter['datetime'] == dt.datetime(2019,9,21,17)]

map1 = px.scatter_mapbox(peak_scooter, lat="endy", lon="endx", width=800, height=800, zoom=11, 
                         labels={'endy': "End Point Longitude", 'endx': "End Point Latitude"},
                         center = {'lat':50.98263, 'lon':-114.10210}, title='Rentals on Sept. 9, 2019: 7-8pm')
map1.show()

Not a single scooter terminated at Heritage Park (Map should have centered on location)

They seem to mostly be situated Downtown, so my guess is people going to/from Beakerhead Fesival, which was also that weekend. https://www.visitcalgary.com/things-to-do/festivals/beakerhead. Highly speculative

Worth noting it was a nice night:

In [13]:
peak_scooter[['datetime','Temp (°C)', 'Wind Spd (km/h)', 'Weather' ]].head(1)
Out[13]:
datetime Temp (°C) Wind Spd (km/h) Weather
117916 2019-09-21 17:00:00 19.2 8.0 Clear

Rentals by Hour

Exploring the cyclical nature of the rentals some more; let's check if there are any interesting patters in rentals based on time of day and if it was a weekend/weekday/holiday.

The following plot shows average rentals per hour over the dataset for the different days.

In [14]:
# Format avg rentals/hr for weekend, holiday and weekdays
weekend_by_hour = scooter[scooter['is_weekend'] == 1].groupby('start_hour').count().reset_index().iloc[:,0:2]
holiday_by_hour = scooter[scooter['is_holiday'] == 1].groupby('start_hour').count().reset_index().iloc[:,0:2]
weekday_by_hour = scooter[(scooter['is_holiday'] == 0) & 
                          (scooter['is_weekend'] == 0)].groupby('start_hour').count().reset_index().iloc[:,0:2]

num_weekends = len(scooter[scooter['is_weekend'] == 1].groupby('start_date').count())
num_holidays = len(scooter[scooter['is_holiday'] == 1].groupby('start_date').count())
num_weekdays = len(scooter[(scooter['is_holiday'] == 0) & (scooter['is_weekend'] == 0)].groupby('start_date').count())
weekday_by_hour['name'] = "Weekday"
holiday_by_hour['name'] = "Holiday"
weekend_by_hour['name'] = "Weekend"
weekday_by_hour['vehicle_type'] = weekday_by_hour['vehicle_type'] / num_weekdays
holiday_by_hour['vehicle_type'] = holiday_by_hour['vehicle_type'] / num_holidays
weekend_by_hour['vehicle_type'] = weekend_by_hour['vehicle_type'] / num_weekends
by_hour = weekday_by_hour.append([holiday_by_hour, weekend_by_hour])

line_fig4 = px.line(by_hour, x='start_hour', y='vehicle_type', color = 'name',
                    title='Scooter Rentals per Hour Based on Day Type',
                    labels={'name': 'Day Type', 'vehicle_type': 'Avg. Scooter Rentals/hr', 'start_hour': 'Time of Day'}
                    )
line_fig4.show()

I think the most interesting observations from this plot are:

  • The little spike in rentals around the morning rush hour, but only on weekdays
  • The increased rentals between midinght and 4am on weekends and holidays. More on this below.

Rentals by Time of Day and Day of Week

A heat map expands on the prior consept of time of the week impacting rentals. On weekdays we see more rentals arond 8-9am vs on the weekends. We also see more rentals late night on Friday and Saturday evenings. I'm sure no one was "scooting" home from the bar...

In [15]:
week_order = {'start_day':['Monday','Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']}
px.density_heatmap(scooter, x='start_day', y='start_hour', category_orders=week_order, 
                   color_continuous_scale = 'plotly3', title='Rentals Based on Time of Day and Day of Week',
                   labels={'start_day':'Day of Week', 'start_hour':'Hour of Day'},
                   width=800, height=600)

Where are Trips Originating and Ending

We can view the starting and ending coordinates from each scooter rental.

While the scooters start throughout most of the entire city, the rentals are definitely concentrated downtown.

In [16]:
# Plot starting point for all scooter trips
grid_count = scooter.groupby('starting_grid_id').count().reset_index().iloc[:,0:2]
grid_count.columns = ['starting_grid_id', 'rental_count']
grid_loc= scooter.groupby('starting_grid_id').mean().reset_index()[['starting_grid_id','startx', 
                                                                    'starty','endx', 'endy',
                                                                    'trip_duration', 'trip_distance']]
grid_count = grid_count.merge(right=grid_loc, on='starting_grid_id')


px.scatter_mapbox(grid_count, lat='starty', lon='startx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Originating Location',
                  labels = {'rental_count':'Total Rentals', 'starty': "Starting Longitude", 
                            'startx': "Starting Latitude"}
                  )
In [17]:
px.scatter_mapbox(grid_count, lat='endy', lon='endx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Finishing Location',
                  labels = {'rental_count':'Total Rentals', 'endy': "Ending Longitude", 
                            'endx': "Ending Latitude"}
                  )

Starting Location vs Date

The following animation shows where trips are originating over the entire trial period. The odd scooter makes its way out of the core, but ultimately that's where most rentals are originating from.

In [18]:
grid_date = scooter.groupby(['start_date','starting_grid_id']).count().reset_index().iloc[:,0:3]
grid_loc = scooter.groupby(['start_date','starting_grid_id']).mean().reset_index()[['start_date', 'starting_grid_id',
                                                                                    'startx', 'starty',
                                                                                    'endx', 'endy',
                                                                                    'trip_duration', 'trip_distance']]
grid_date = grid_date.merge(grid_loc, on=['start_date', 'starting_grid_id'])
grid_date.columns = ['start_date', 'starting_grid_id', 'rental_count', 'startx', 'starty', 'endx', 'endy',
                    'trip_duration', 'trip_distance']
grid_date['start_date'] = grid_date['start_date'].apply(lambda x: x.strftime("%d-%b-%Y"))

px.scatter_mapbox(grid_date, lat='starty', lon='startx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Originating Location by Date',
                  labels = {'rental_count':'Total Rentals', 'starty': "Starting Longitude", 
                            'startx': "Starting Latitude", 'start_date':'Date',
                            'trip_duration': "Trip Time (s)", 'trip_distance': 'Trip Distance (m)'},
                  hover_data=['trip_duration', 'trip_distance'],
                  animation_frame = 'start_date'
                  )
In [19]:
px.scatter_mapbox(grid_date, lat='endy', lon='endx', color='rental_count',
                  zoom=9,  color_continuous_scale = 'plotly3',mapbox_style='dark',
                  width=800, height=800, title = 'All Trips: Final Location by Date',
                  labels = {'rental_count':'Total Rentals', 'starty': "Ending Longitude", 
                            'startx': "Ending Latitude", 'start_date':'Date',
                            'trip_duration': "Trip Time (s)", 'trip_distance': 'Trip Distance (m)'},
                  animation_frame = 'start_date',
                  hover_data=['trip_duration', 'trip_distance'],
                  )

Is This a Viable Business?

Another question that I had was how much money could the scooters possibly be making. While I don't have any insites into the business model, we can at least guess how much revenue is generated by the scooters.

Info on pricing wasn't availaboe on Lime's webiste, but I found an article at: https://dailyhive.com/calgary/lime-scooters-calgary-how-to-guide-2019 that mentions \$1 for the first min and \\$0.30 thereafter. Further analysis assumes that all trips followed this cost model and that all trips were paid in full i.e. no discounts or promotions. This isn't going to be accurate but it's about the best I can do.

In [20]:
def scooter_revenue(trip_time):
    """Calculates scooter revenue ($) as a function of trip time assuming $1 to start and 
    $0.30/min thereafter"""
    return 0.3*trip_time//60 + 1
In [21]:
# Calc revenue for all scooter trips
scooter['trip_cost'] = scooter['trip_duration'].apply(scooter_revenue).values
In [22]:
hist_fig1 = px.histogram(scooter.sample(frac=0.05), x='trip_cost', histnorm='probability', marginal = 'box',
             title='Sample Distribution of Cost of Scooter Rentals (Fraction of Rentals)', 
             labels={'count': 'Percent of Total Rentals', 'trip_cost':'Total Cost of Trip ($)'})
hist_fig1.show()

Summary Statistics for Trip Cost

In [23]:
print(scooter.trip_cost.describe())
count    464743.000000
mean          4.369383
std           4.074139
min           1.000000
25%           2.000000
50%           3.000000
75%           5.000000
max          48.000000
Name: trip_cost, dtype: float64

Total Projected Revenue:

In [24]:
print(scooter.trip_cost.sum())
2030640.0

Probably as expected, the distribution of trip cost is right skewed with a median trip cost of about \$3/trip and a mean cost of \\$4.37/trip.

Total estimated revenue was \$2,030,640 over a 3 month period! And as we saw above, they weren't even fully operational over those three months. I have no insites into the business model, but that's a lot more than I was expecting. I wonder if someone actually paid \\$48 for a scooter trip!

Weather Impacts

One previously stated goal was to study the impact of weather on scooter rentals. It seems intuative that the weather should impact the number of scooters rented; you could probably predict rentals pretty well just by using the time of day and day of the week. (An exercise for future work)

The first chart shows the total fraction of rentals from the entire dataset. Blue is the total number of scooter rentals and red is the fraction of 'hours' that showed that weather type.

For instance, 43% of the time in the dataset it was clear, but 47% of scooters were rented when it was clear. Conversly 10% of the time it rained, but only 8% of rentals happend when raining.

It didn't snow much over the trial period, but there were very few rentals when it did snow. Worth confirming, but it's possible that the scooters were actually removed from operation when it snowed in September.

In [25]:
scooter_count = scooter.groupby('Weather').count().reset_index().iloc[:,0:2]
scooter_count.columns = ['Weather', 'Rentals']
weather_count = scooter[['datetime','Weather']].drop_duplicates().groupby('Weather').count().reset_index()
weather_count.columns = ['Weather', 'Hours']
total_count = weather_count.merge(scooter_count, on="Weather")
total_count['Hours']  = total_count['Hours'] / total_count['Hours'].sum()
total_count['Rentals']  = total_count['Rentals'] / total_count['Rentals'].sum()
total_count = total_count.melt(id_vars = 'Weather', value_name = 'Percentage of Total', var_name = 'Category')

# Percentage of renatals with that weather vs percentage of hours with that value
px.bar(total_count, x = 'Weather', y ='Percentage of Total', color = 'Category', barmode = 'group', opacity=1,
       title='Rentals vs Weather')

Comparing temperature to number of rentals, it looks like there are more rentals when it's warmer, but the data also clusters around time of day. i.e. it doesn't matter if it's 20$^{\circ}C$ at midnight, there won't be many rentals.

In [26]:
temperature_df = scooter.groupby('datetime').mean().reset_index()[['datetime', 'Temp (°C)',
                                                                   'Wind Spd (km/h)', 'start_hour']]
rentals_per_hr = scooter.groupby('datetime').count().reset_index().iloc[:,0:2]
rentals_per_hr.columns = ['datetime', 'count']
rentals_per_hr = rentals_per_hr.merge(temperature_df, on='datetime')
rentals_per_hr.datetime = rentals_per_hr.datetime.apply(lambda x: x.strftime("%d-%b-%Y"))
px.scatter(rentals_per_hr, y='count', x='Temp (°C)', color='start_hour', color_continuous_scale = 'plotly3',
           hover_data = {'datetime'}, labels={'datetime':'Date'})

A more interesting plot shows just the rentals vs temperature at 4pm. Here there' is more of a positive trend.

In [27]:
px.scatter(rentals_per_hr[rentals_per_hr.start_hour == 16], y='count', x='Temp (°C)',
           hover_data = {'datetime'}, labels={'datetime':'Date'})

8am however looks more like random scatter. You probably don't care about temperature when deciding if you're riding a scooter to work. Note: Weekends aren't broken out here.

In [28]:
px.scatter(rentals_per_hr[rentals_per_hr.start_hour == 8], y='count', x='Temp (°C)',
           hover_data = {'datetime'}, labels={'datetime':'Date'})

Weather impact on rentals, superficially looks as expected: More rentals when it's nice, and less when it's not. Don't expect the scooters to operate over the winter. More analysis could be done to actually quantify the weather impact on rentals.

Rider Types

The last thing I wanted to do was investigate if I could, at a high level, attempt to classify the types of rides that are happening on the scooters.

I personally witnessed lots of people grabbing the scooters and more or less, "Taking them for a spin," with no real purpose in mind other than to try them out.

The city comissioned a survey and published that one in three trips replaced a car. I'd like to see how plausable that is with the data. https://www.cbc.ca/news/canada/calgary/calgary-e-scooter-report-1.5396846

When cleaning the data, I added a column for "aerial distance" which is basically a straight line between the trip stating point and ending point. Those coordinates are anonamized, so the actual start and end points could be up to ~62m from the point in the dataset. So the actual aerial distance traveled is +/- ~124m.

I used Principal Component Analysis (PCA) on distance traveled, trip time and aerial distance to see if any interesting observations emerged.

In [29]:
scooter_sample = scooter.sample(frac=0.01) # Use sample so points are actually visible

# Columns for pca
pca_cols=['trip_distance', 'trip_duration', 'a_dist'] 

# Scale data and convert back to a DataFrame
scale = StandardScaler()
df_scaled = scale.fit_transform(scooter_sample[pca_cols])
df_scaled = pd.DataFrame(df_scaled) 
df_scaled.columns = [pca_cols]

# Run PCA on the feature set dataframe
pca = PCA(n_components = 2)
principle_components = pca.fit_transform(df_scaled)

# Stick back into a DataFrame 
df_pca = pd.DataFrame(principle_components)
df_pca.columns = ['pc1','pc2']
df_pca = pd.DataFrame(scale.fit_transform(df_pca))
df_pca.columns = ['pc1', 'pc2']
In [30]:
# Plot using the Principle Components as Axes
sns.lmplot('pc1', 'pc2', df_pca, fit_reg=False, height=8)

# set the maximum variance of the first two PCs
# this will be the end point of the arrow of each **original feature**
xvector = pca.components_[0]
yvector = pca.components_[1]
 
# value of the first two PCs, set the x, y axis boundary
xs = pca.transform(df_scaled)[:,0]
ys = pca.transform(df_scaled)[:,1]

# label countries
# for row in range(0,df_pca.shape[0]):
#      plt.text(df_pca.pc1[row]+0.01, df_pca.pc2[row], 
#      df_pca.country[row], horizontalalignment='left', 
#      size='small', color='grey', weight='light')

# arrows project features (columns from csv) as vectors onto PC axes
for i in range(len(xvector)):
    plt.arrow(0, 0, xvector[i]*max(xs), yvector[i]*max(ys),
              color='r', width=0.005, head_width=0.05)
    plt.text(xvector[i]*max(xs)*1.1, yvector[i]*max(ys)*1.1,
             list(scooter_sample[pca_cols].columns.values)[i], color='r')

plt.annotate("Productive Trips", xy=(6,6)) 
plt.annotate('"Joy Rides!"', xy=(4,-4)) 
plt.title('PCA of Scooter Trip Data')
plt.show()

I called trips migrating towards the top right of this plot "Productive" trips as the aerial distance increases as trip distance increases. Trips in the lower half of the chart I call "Joy Rides" as the trip duration and distance is increasing, but the aerial distance is relatively low. This would represent a trip where someone started and ended at roughly the same place.

As expected, most trips are actually relatively short in duration and distance.

A metric "Trip Efficiency" is calculated as the ratio of aerial distance to measured trip distance. Theoretically the maximum for this metric should be 1, but due to the inaccuracy of the start and end point, sometimes it is >1. Also in theory if the scooter was being carried, or on the train etc, this ratio could be >1.

Below is a historgram of trip efficiency for a sample of scooter rentals:

In [31]:
hist_fig2 = px.histogram(scooter.sample(frac=0.05), x='travel_efficiency', histnorm='probability', marginal = 'box',
                         title='Sample Distribution of Travel Efficiency of Scooter Rentals (Fraction of Rentals)', 
                         labels={'travel_efficiency':'Aerial Distance/Trip Distance'},
                         nbins=100)
hist_fig2.show()

We see that the most popular range for trip efficiency is in the 0.4-0.9 range, which is probably about as expected if you were actually using the scooter to go somewhere.

That said there are also a lot of trips with low travel efficiencies. I'd speculate these were more "just for fun" rides.

Classifying Trips

The original question was: What fraction of trips could plausably have replaced a trip with a car?

I'll use some completely made up qualifiers to decide if the trip could have replaced a car. We'll assume your average millenial (did I mention I'm making this up) walks at 1.35m/s https://www.healthline.com/health/exercise-fitness/average-walking-speed#average-speed-by-age.

If you don't like my assumptions, feel free to substitute your own.

Qualifier are:

1) Trip distance must be >810m which is about a 10 min walk. Yes some people take cars for shorter trips (we call them lazy) but indulge me. 2) Travel efficiency must be >0.3. If you're meandering more than that, I'm guessing you probably are just out for a ride.

That gives me approximately half of trips possibly could replace a car, so the 1 in 3 seems plausable.

In [32]:
# Calculation for above
car_test = scooter[(scooter.trip_distance > 810) & 
                   (scooter.travel_efficiency>0.3)].start_date.count()/scooter.start_date.count()
print("Fraction of trips that could have replaced a car: ", round(car_test,3)*100, "%")
Fraction of trips that could have replaced a car:  49.9 %

If you don't like my assumptions for minimum distance and efficiency threshold, feel free to use the following chart to look up the fraction of trips that could have replaced a car, based on your own assumptions.

In [33]:
# do it on a range of inputs

lazy_threshold = np.arange(100, 1600, 100).tolist()
car_trips = []
travel_eff = np.arange(.1, 1, 0.1).tolist()


for thresh in lazy_threshold:
    for eff in travel_eff:
        car_trips.append(scooter[(scooter.trip_distance > thresh) & 
                            (scooter.travel_efficiency>eff)].start_date.count()/scooter.start_date.count())
        
car_df = pd.DataFrame(zip(travel_eff*len(lazy_threshold),
                      [item for item in lazy_threshold for i in range(len(travel_eff))], car_trips))

line_fig5 = px.line(car_df, x=0, y=2, animation_frame=1, title = "Fraction of Trips That Could Replace a Car",
        labels={"0":"Minimum Travel Efficiency", "1":"Minimum Distance Threshold", "2":"Fraction of Trips"})
line_fig5.show()

Bonus Analysis

According the the City of Calgary's website the maximum speed of the scooters is 20km/h. https://www.calgary.ca/Transportation/TP/Pages/Cycling/Cycling-Strategy/Shared-electric-scooter-pilot.aspx?redirect=/scootershare

Looking at the data it appears that some people were able to achieve higher average speeds in practice.

In [34]:
hist_fig3 = px.histogram(scooter.sample(frac=0.05), x='speed', nbins=100, histnorm='probability', marginal = 'box',
                         title='Sample Distribution of Average Scooter Speed (Full Trip)', 
                         labels={'count': 'Percent of Total Rentals', 'speed':'Average Speed (km/h)'},
                         range_x=(0,40))
hist_fig3.show()

Conclusions

It's hard to argue that the e-scooter were quite popular in Calgary. It's no surprize that they will continue next year, while e-bike rentals will not return (At least for Lime).

Rentals cover much of the city with most in the central downtown area.

We don't know all the details of the business model, but the revenue potential is certainly there. While there were certainly many novelty rides, it does look like people were using the scooters to actually travel places. This bodes well for the sustainability of the business model.

In [ ]:
 
In [ ]:
 
In [ ]: